Applied Data Science (MAST30034)
Welcome
Welcome to Applied Data Science for 2021 Semester 2!
This is a capstone project subject, hence, expectations are higher than most other subjects that you will take in your undergraduate course. It is expected that students have already completed assessments to a satisfactory level for the following subjects:
Elements of Data Processing (COMP20008)
Statistics (MAST20005)
Machine Leaarning (COMP30027)
Linear Statistical Models (MAST30025)
If you are unfamiliar with GitHub, it is in your best interest to revise how to use it or attend a consultation / revision workshop to learn.
Teaching Team
Your teaching staff will be as follows:
Lecturer: Dr. Karim Seghouane (Assignment 1)
Subject Coordinator: Akira Wang (Project 1 and 2)
Tutor: Yue You
Tutorial Structure
Tutorials are broken into Python and R streams to support students in whichever language they prefer.
The first hour of the tutorial will be based on general programming how-to’s and walkthroughs.
The remainder of the tutorial will generally follow a consultation / free-for-all style. That is, we can cover a topic of request out of the Advanced Tutorials module, answer project related questions, or ask questions about industry / applying for jobs.
You are free to attend any tutorial time, either half (or the full 2 hours) of the tutorial depending on your interests. You are all classified as experienced university veterans so do what works for you.
Finally, tutorial attendence is not marked for the duration of Project 1 and Assignment 1, but there is an expectation that you attend tutorials with your group for Project 2.
Lab 1 Overview
First Half
Using the Rstudio server:
Write in R markdown. Cheat sheet
Using GitHub Desktop vs Git CLI (Command Line Interface):
- Create a repository for your Project 1, push a commit, and ensure your repository accepts the changes. Click me for more information.
Project 1 Tips:
How to get started and what to look out for. Click me
Getting started on Latex with Overleaf. Overleaf tutorial
Second Half
Revision:
Variable names and types.
Pipe operator.
dplyr verbs in action.
Plotting geospatial maps.
Downloading files using urllib.
General Tips for R markdown
On mac:
- Ctrl + option + i : Insert code chunk {r}
- Cmd + Shift + c : Comment out lines #
- shift + enter : Run current cell (equivalent of pressing )
Using git on the VM
https://rstudio.mast30034.science.unimelb.edu.au/
Cloning:
Open a terminal (yes it is commandline git for this to work).
git clone HTTPS (where HTTPS is the https url to your gitlab repo).
Enter your credentials.
Done.
Pushing:
Change directories to inside your repository (cd NAME_OF_REPO_FOLDER).
git add . (this will add all files in the current directory to a commit - you can specify specific files if you would like instead).
git commit -m “message” (make a commit with a message).
git push
Enter your credentials.
Done.
Readable Code
We will be assessing the quality of your code and how you present it in your notebooks.
This is because there is no point writing code that cannot be easily interpreted. At the end of the day, clients are paying for your analysis, but also the corresponding code.
If your code is confusing or difficult to read, there is little chance your client will come back to you.
Variable Names
As long as you are consistent, then it is fine. For example, commit to either using:
Snake Case: words are seperated by underscores such as variable_name
Camel Case: words are seperated by captials such as variableName
Your variables should be contextual and describe the code. That is, try to name your variables to be understandable without comments.
Let’s get started!
Install t-map package (Library for thematic maps) and other required R packages
#install.packages("dplyr")
#install.packages("sf")
#install.packages("curl")
#install.packages("tmap")Install ggmap
#install.packages("ggmap")
#OR
#install.packages("devtools")
#devtools::install_github("dkahle/ggmap")Load libraries
Read in the data
## [1] "/mnt/student.unimelb.edu.au/yueyou/MAST30034_R/tutorials"
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 2 1/12/15 0:00 1/12/15 0:05 5
## 2 2 1/12/15 0:00 1/12/15 0:00 2
## 3 2 1/12/15 0:00 1/12/15 0:00 1
## 4 1 1/12/15 0:00 1/12/15 0:05 1
## 5 1 1/12/15 0:00 1/12/15 0:09 2
## 6 1 1/12/15 0:00 1/12/15 0:16 1
## trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag
## 1 0.96 -73.97994 40.76538 1 N
## 2 2.69 -73.97234 40.76238 1 N
## 3 2.62 -73.96885 40.76453 1 N
## 4 1.20 -73.99393 40.74168 1 N
## 5 3.00 -73.98892 40.72699 1 N
## 6 6.30 -73.97408 40.76291 1 N
## dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax
## 1 -73.96631 40.76309 1 5.5 0.5 0.5
## 2 -73.99363 40.74600 1 21.5 0.0 0.5
## 3 -73.97455 40.79164 1 17.0 0.0 0.5
## 4 -73.99767 40.74747 1 6.5 0.5 0.5
## 5 -73.97559 40.69687 2 11.0 0.5 0.5
## 6 -74.01280 40.70221 1 20.5 0.5 0.5
## tip_amount tolls_amount improvement_surcharge total_amount
## 1 1.00 0 0.3 7.80
## 2 3.34 0 0.3 25.64
## 3 3.56 0 0.3 21.36
## 4 0.20 0 0.3 8.00
## 5 0.00 0 0.3 12.30
## 6 4.35 0 0.3 26.15
Variable names and types. (Comments)
Check the dimensions of the dataset.
## [1] 100000 19
## [1] "VendorID" "tpep_pickup_datetime" "tpep_dropoff_datetime"
## [4] "passenger_count" "trip_distance" "pickup_longitude"
## [7] "pickup_latitude" "RatecodeID" "store_and_fwd_flag"
## [10] "dropoff_longitude" "dropoff_latitude" "payment_type"
## [13] "fare_amount" "extra" "mta_tax"
## [16] "tip_amount" "tolls_amount" "improvement_surcharge"
## [19] "total_amount"
## [1] "list"
## 'data.frame': 100000 obs. of 19 variables:
## $ VendorID : int 2 2 2 1 1 1 2 2 2 2 ...
## $ tpep_pickup_datetime : Factor w/ 237 levels "1/12/15 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ tpep_dropoff_datetime: Factor w/ 536 levels "1/12/15 0:00",..: 5 1 1 5 9 16 2 8 17 10 ...
## $ passenger_count : int 5 2 1 1 2 1 6 2 1 2 ...
## $ trip_distance : num 0.96 2.69 2.62 1.2 3 6.3 0.63 1.91 4.5 1.42 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.8 40.8 40.7 40.7 ...
## $ RatecodeID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.8 40.7 40.7 ...
## $ payment_type : int 1 1 1 1 2 1 1 1 1 1 ...
## $ fare_amount : num 5.5 21.5 17 6.5 11 20.5 4 8 16.5 8.5 ...
## $ extra : num 0.5 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ mta_tax : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip_amount : num 1 3.34 3.56 0.2 0 4.35 1.06 1.86 3.56 2.45 ...
## $ tolls_amount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 7.8 25.6 21.4 8 12.3 ...
## pickup_latitude pickup_longitude
## Min. : 0.00 Min. :-77.05
## 1st Qu.:40.73 1st Qu.:-73.99
## Median :40.75 Median :-73.98
## Mean :40.14 Mean :-72.88
## 3rd Qu.:40.77 3rd Qu.:-73.97
## Max. :42.74 Max. : 0.00
## integer(0)
## pickup_latitude pickup_longitude
## 1st Qu.:40.73 1st Qu.:-73.99
## 3rd Qu.:40.77 3rd Qu.:-73.97
Pipe operator.
Pipe operator: %>%. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.
- Cmd + Shift + m : Insert pipe operator %>%
Example:
## VendorID trip_distance
## 1 2 0.96
## 2 2 2.69
## 3 2 2.62
## 4 1 1.20
## 5 1 3.00
## 6 1 6.30
Now in this case, we will pipe the data frame to the function that will select two columns and then pipe the new data frame to the function head() which will return the head of the new data frame.
## VendorID trip_distance
## 1 2 0.96
## 2 2 2.69
## 3 2 2.62
## 4 1 1.20
## 5 1 3.00
## 6 1 6.30
dplyr verbs in action.
Filtering and variable selection.
Select a set of columns. Filter the rows with specified conditions.
## trip_distance pickup_longitude pickup_latitude
## 1 1.2 -73.99393 40.74168
## 2 3.0 -73.98892 40.72699
## 3 6.3 -73.97408 40.76291
## 4 0.0 -73.99016 40.75620
## 5 1.0 -73.99577 40.74379
## 6 1.8 -73.98841 40.76442
df %>%
filter(VendorID == 1 & passenger_count > 0) %>%
select(trip_distance, pickup_longitude,pickup_latitude) %>%
head## trip_distance pickup_longitude pickup_latitude
## 1 1.2 -73.99393 40.74168
## 2 3.0 -73.98892 40.72699
## 3 6.3 -73.97408 40.76291
## 4 0.0 -73.99016 40.75620
## 5 1.0 -73.99577 40.74379
## 6 1.8 -73.98841 40.76442
df %>%
filter(VendorID == 2 & passenger_count > 100) %>%
select(trip_distance, pickup_longitude,pickup_latitude)## [1] trip_distance pickup_longitude pickup_latitude
## <0 rows> (or 0-length row.names)
Arrange or re-order rows.
df %>%
select(VendorID, passenger_count, trip_distance) %>%
arrange(passenger_count, trip_distance, VendorID) %>%
head## VendorID passenger_count trip_distance
## 1 2 0 0.0
## 2 1 0 0.7
## 3 1 1 0.0
## 4 1 1 0.0
## 5 1 1 0.0
## 6 1 1 0.0
Create new columns.
df %>%
mutate(pickup_posi = paste0("(",pickup_longitude,",",pickup_latitude,")")) %>%
select(pickup_posi) %>%
head## pickup_posi
## 1 (-73.97994232,40.76538086)
## 2 (-73.97233582,40.76237869)
## 3 (-73.96884918,40.76453018)
## 4 (-73.99393463,40.74168396)
## 5 (-73.98892212,40.72698975)
## 6 (-73.97408295,40.76291275)
Create summaries of the data frame.
## avg_longi avg_lati
## 1 -72.8758 40.14371
Group operations.
df %>%
group_by(VendorID) %>%
summarise(avg_longi = mean(pickup_longitude),
avg_lati = mean(pickup_latitude)) ## # A tibble: 2 x 3
## VendorID avg_longi avg_lati
## <int> <dbl> <dbl>
## 1 1 -72.3 39.8
## 2 2 -73.4 40.4
(Cheat sheet available!!) [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf]
Download and view map
ggmap
The basic idea driving ggmap is to take a downloaded map image, plot it as a context layer using ggplot2, and then plot additional content layers of data, statistics, or models on top of the map. In ggmap this process is broken into two pieces – (1) downloading the images and formatting them for plotting, done with get_map, and (2) making the plot, done with ggmap.
The get_stamenmap() function reqiures a bounding box, i.e the top, bottom, left and right latitude/longitude of the map you want to plot. For example, the latitude/longitude for US map are as follows:
bbox <- c(bottom = 25.75, top = 49 , right = -67, left = -125)
usmap <- get_stamenmap(bbox = bbox, zoom = 6, maptype = 'toner-lite')
ggmap(usmap)The geocode_OSM() geocodes a location (based on a search query) to coordinates and a bounding box. Similar to geocode from the ggmap package. It uses OpenStreetMap Nominatim.
xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges## [1] -77.04710 -71.06483
## [1] 37.27044 42.73614
map_big <- get_stamenmap(
rbind(xranges[1]+1,yranges[1]+2.5,xranges[2]-1,yranges[2]-1),
zoom = 8)
ggmap(map_big)A lot of map tiles that you can use.
Now, let’s try plot somethings over the map!
Plot pickup locations.
ggmap(map_big) +
geom_point(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
colour="blue", size =2)The equivalent for dropoffs.
ggmap(map_big) +
geom_point(data = df,
aes(x = dropoff_longitude,
y = dropoff_latitude),
colour="red", size = 2)ggmap(map_big) +
geom_point(data = df,
aes(x = dropoff_longitude,
y = dropoff_latitude),
colour="red", size = 0.05) +
geom_point(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
colour="blue", size = 0.05) Geospatial Inferences
More pickups around central Manhattan, with more dropoffs in the surrounding bouroughs.
Pickup location are easily divided into “hubs” (i.e Manhattan, Aiport, etc).
Dropoffs seem to be scattered across the map.
IMPORTANT: The above is at most describing the plot. Your project will require analysis and research on top of describing a plot. That is:
Why might there be more pickups around central Manhattan?
Is there an explanation surrounding the “hubs”?
Why are dropoffs scattered across the map?
As a suggestion, have less description and more analysis. Your visualisation should ensure that it can be easily interpreted and visible (i.e suitable font size, colour, alpha, legend, etc.)
Where to go from here
We have a simple visualisation on the pickups and dropoffs, but how might they be affected?
- Perhaps we can take a look at the time, day of week, the weather conditions, events that are taking place, etc. It is up to you to find an external dataset to answer these questions.
Data Serialisation
Feather:
Lightweight and super fast serialization for data using Apache Arrow.
Python and R native, though not compatible with all data formats.
Medium space, Low time.
start.time <- Sys.time()
write_feather(df, paste0(filepath,"df.feather"))
end.time <- Sys.time()
round((end.time - start.time), 3)## Time difference of 0.028 secs
start.time <- Sys.time()
write.csv(df, paste0(filepath,"df.csv"))
end.time <- Sys.time()
round((end.time - start.time), 3)## Time difference of 1.585 secs
Read in.
Downloading files
URL <- "https://github.com/YOU-k/MAST30034_R/tree/main/data/sample.csv"
destfile ="../data/download.csv"
download.file(URL, destfile)SessionInfo
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] feather_0.3.5 tmaptools_3.1-1 tmap_3.3-2 ggmap_3.0.0
## [5] ggplot2_3.3.5 curl_4.3.2 sf_1.0-1 dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 tidyr_1.1.3 viridisLite_0.4.0
## [4] assertthat_0.2.1 sp_1.4-5 highr_0.9
## [7] yaml_2.2.1 pillar_1.6.1 lattice_0.20-41
## [10] glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2
## [13] colorspace_2.0-2 htmltools_0.5.1.1 plyr_1.8.6
## [16] XML_3.99-0.6 pkgconfig_2.0.3 raster_3.4-13
## [19] stars_0.5-3 bookdown_0.22 purrr_0.3.4
## [22] scales_1.1.1 jpeg_0.1-8.1 tibble_3.1.2
## [25] proxy_0.4-26 generics_0.1.0 farver_2.1.0
## [28] ellipsis_0.3.2 withr_2.4.2 leafsync_0.1.0
## [31] cli_3.0.0 magrittr_2.0.1 crayon_1.4.1
## [34] evaluate_0.14 fansi_0.5.0 lwgeom_0.2-6
## [37] class_7.3-17 tools_4.0.3 hms_1.1.0
## [40] RgoogleMaps_1.4.5.3 lifecycle_1.0.0 stringr_1.4.0
## [43] munsell_0.5.0 compiler_4.0.3 e1071_1.7-7
## [46] rlang_0.4.11 classInt_0.4-3 units_0.7-2
## [49] grid_4.0.3 dichromat_2.0-0 rstudioapi_0.13
## [52] rjson_0.2.20 htmlwidgets_1.5.3 crosstalk_1.1.1
## [55] leafem_0.1.6 bitops_1.0-7 base64enc_0.1-3
## [58] labeling_0.4.2 rmarkdown_2.9 gtable_0.3.0
## [61] codetools_0.2-16 abind_1.4-5 DBI_1.1.1
## [64] R6_2.5.0 knitr_1.33 utf8_1.2.1
## [67] KernSmooth_2.23-17 stringi_1.7.2 parallel_4.0.3
## [70] rmdformats_1.0.2 Rcpp_1.0.7 vctrs_0.3.8
## [73] png_0.1-7 leaflet_2.0.4.1 tidyselect_1.1.1
## [76] xfun_0.24
Comments and Docstrings
Cells in R markdown should aim to do one “block of logic” at a time (i.e importing libraries, defining functions, filtering rows, etc).
If it takes a reader more than a few seconds to understand your cell, you need comments.
Your functions need to have docstrings describing what they do.